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Abstract 

We study the distributions of the LASSO, SCAD, and thresholding 
estimators, in finite samples and in the large-sample limit. The asymp- 
totic distributions are derived for both the case where the estimators are 
tuned to perform consistent model selection and for the case where the es- 
timators are tuned to perform conservative model selection. Our findings 
complement those of Knight and Fu (2000) and Fan and Li (2001). We 
show that the distributions are typically highly nonnormal regardless of 
how the estimator is tuned, and that this property persists in large sam- 
ples. The uniform convergence rate of these estimators is also obtained, 
and is shown to be slower than rtT 1 ^ 1 in case the estimator is tuned 
to perform consistent model selection. An impossibility result regarding 
estimation of the estimators' distribution function is also provided. 

MSC 2000 subject classification. Primary 62J07, 62J05, 62F11, 62F12, 
62E15. 

Key words and phrases. Penalized maximum likelihood, LASSO, 
SCAD, thresholding, post-model-selection estimator, finite-sample distri- 
bution, asymptotic distribution, oracle property, estimation of distribu- 
tion, uniform consistency. 

1 Introduction 

Penalized maximum likelihood estimators have been studied intensively in the 
last few years. A prominent example is the least absolute selection and shrink- 
age (LASSO) estimator of Tibshirani (1996). Related variants of the LASSO 
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include the Bridge estimators studied by Frank and Friedman (1993), least an- 
gle regression (LARS) of Efron, Hastic, Johnston, Tibshirani (2004), or the 
smoothly clipped absolute deviation (SCAD) estimator of Fan and Li (2001). 
Other estimators that fit into this framework are hard- and soft-thresholding 
estimators. While many properties of penalized maximum likelihood estimators 
are now well understood, the understanding of their distributional properties, 
such as finite-sample and large-sample limit distributions, is still incomplete. 
The probably most important contribution in this respect is Knight and Fu 
(2000) who study the asymptotic distribution of the LASSO estimator (and of 
Bridge estimators more generally) when the tuning parameter governing the in- 
fluence of the penalty term is chosen so that the LASSO acts as a conservative 
model selection procedure (that is, a procedure that does not select underpa- 
rameterized models asymptotically, but selects overparamcterizcd models with 
positive probability asymptotically); see also Knight (2008). In Knight and Fu 
(2000), the asymptotic distribution is obtained in a fixed-parameter as well as 
in a standard local alternatives setup. This is complemented by a result in Zou 
(2006) who considers the fixed-parameter asymptotic distribution of the LASSO 
when tuned to act as a consistent model selection procedure. Another contribu- 
tion is Fan and Li (2001) who derive the asymptotic distribution of the SCAD 
estimator when the tuning parameter is chosen so that the SCAD estimator 
performs consistent model selection; in particular, they establish the so-called 
'oracle' property for this estimator. The results in that latter paper are also 
fixed-parameter asymptotic results. It is well-known that fixed-parameter (i.e., 
pointwise) asymptotic results can give a wrong picture of the estimators' ac- 
tual behavior, especially when the estimator performs model selection; see, e.g., 
Kabaila (1995), or Leeb and Potscher (2005, 2008a). Therefore, it is interesting 
to take a closer look at the actual distributional properties of such estimators. 

In the present paper we study the finite-sample as well as the asymptotic 
distributions of the hard-thresholding, the LASSO (which coincides with soft- 
thresholding in our context), and the SCAD estimator. We choose a model that 
is simple enough to facilitate an explicit finite-sample analysis that showcases the 
strengths and weaknesses of these estimators in a readily accessible framework. 
Yet, the model considered here is rich enough to demonstrate a variety of phe- 
nomena that will also occur in more complex models. We study both the cases 
where the estimators are tuned to perform conservative model selection as well 
as where the tuning is such that the estimators perform consistent model selec- 
tion. We find that the finite-sample distributions can be decisively non-normal 
(e.g., multimodal). Moreover, we find that a fixed-parameter asymptotic anal- 
ysis gives highly misleading results. In particular, the 'oracle' property, which 
is based on a fixed-parameter asymptotic analysis, is shown to not provide a 
reliable assessment of the estimators' actual performance. For these reasons, we 
also obtain the asymptotic distributions of the estimators mentioned before in a 
general 'moving parameter' asymptotic framework, which better captures essen- 
tial features of the finite-sample distribution. [Interestingly, it turns out that in 
the consistent model selection case a 'moving parameter' asymptotic framework 
more general than the usual n -1 / 2 -local asymptotic framework is necessary to 
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exhibit the full range of possible limiting distributions.] Furthermore, we derive 
the uniform convergence rate of the estimators and show that it is slower than 
n -!/ 2 m th e case where the estimators are tuned to perform consistent model 
selection. This again exposes the misleading character of the 'oracle' property. 
We also show that the finite-sample distribution of these estimators can not 
be estimated in any reasonable sense, complementing results of this sort in the 
literature (Leeb and Potscher (2006a,b, 2008b), Potscher (2006)). In a subse- 
quent paper, Potscher and Schneider (2009), analogous results are obtained for 
the adaptive LASSO estimator. 

We note that penalized maximum likelihood estimators are intimately re- 
lated to more classical post-model-selection estimators. The distributional prop- 
erties of the latter estimators have been studied by Sen (1979), Potscher (1991), 
and Leeb and Potscher (2003, 2005, 2006a,b, 2008b). 

The paper is organized as follows: The model and the estimators are in- 
troduced in Section [2] and the model selection probabilities are discussed in 
Section [3] Consistency, uniform consistency, and uniform convergence rates of 
the estimators are the subject of Section [4j The finite-sample distributions are 
derived in Sectio nj5?T| whereas the asymptotic distributions are studied in Sec- 
tion |5.2| Section [6] provides impossibility results concerning the estimation of 
the finite-sample distributions of the estimators, and Section [7] concludes and 
summarizes our main findings. The appendix contains results on the asymp- 
totic distribution in the consistent model selection case when the estimators 
are scaled by the inverse of the uniform convergence rate obtained in Section [4] 
rather than by n 1 / 2 . 

2 The Model and the Estimators 

We start with the orthogonal linear regression model 

Y = Xf3 + u 

where X'X is diagonal and the vector u is multivariate normal with mean zero 
and variance covariance matrix a 2 1. The multivariate linear model with orthog- 
onal design occurs in many important settings, including wavelet regression or 
the analysis of variance. Because we consider penalized least-squares estimators 
with a penalty term that is separable with respect to /3, the resulting estima- 
tors for the components of (3 are mutually independent and each component 
estimator is equivalent to the corresponding penalized least squares estimator 
in a univariate Gaussian location model. We therefore restrict attention to this 
simple model in the sequel without loss of generality. 

Suppose y\,...,y n are independent and each distributed as N(8,cr 2 ). We 
assume for simplicity that a 2 is known, and hence we can set a 2 = 1 without 
loss of generality. Apart from the standard maximum likelihood (least squares) 
estimator y we consider the following estimators: 

1. The hard-thresholding estimator 6h — yl(|y| > V n ) where the threshold 
rj n is a positive real number and l(-) denotes the indicator function. The 
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threshold rj n is a tuning parameter set by the user. The hard-thresholding 
estimator can be viewed as a penalized least-squares estimator that arises 
as the solution to the minimization problem^] 

I>-0) 2 + n(r 1 l-{\9\-r ln )H{\e\<Tl n ))- 

t=i 

We also note here that for rj n — n -1 / 4 the hard-thresholding estimator 
is a simple instance of Hodges' estimator (see, e.g., Lehmann and Casella 
(1998), pp. 440-443). 

2. The soft-thresholding estimator 9s = sign(y)(|y| — ??„) + with r\ n as before. 
[Here sign(a;) is defined as —1, 0, and 1 in case x < 0, x = 0, and x > 0, 
respectively, and z + is shorthand for max{z, 0}.] That estimator arises as 
the solution to the penalized least-squares problem 

n 

Y / (y t -o) 2 + 2n Vn \e\ 
t=i 

which shows that 9s coincides with the LASSO in the form considered 
in Knight and Fu (2000). Note that the tuning parameter in the latter 
reference is A„ = 2nr\ n . 

3. The SCAD-estimator of Fan and Li (2001) is - in the present context - 
given by 

( sign(y)(|y| - r?J+ if \y\ < 2r) n , 

@ scad = < {(a - l)y - sign(y)a?7„} /(a - 2) if 2t]„ < \y\ < aj] nl 

{ y if \y\ > av n , 

where a > 2 is an additional tuning parameter. This estimator can be 
viewed as a simple combination of soft-thresholding for 'small' \y\ and 
hard-thresholding for 'large' \y\, with a (pieccwise) linear interpolation in- 
between. Alternatively, the estimator can also be obtained as a solution 
to a penalized least squares problem; see Fan and Li (2001) for details. 
We note that the SCAD-estimator is closely related to the firm shrinkage 
estimator of Bruce and Gao (1996). 



3 Model Selection Probabilities 

Each of the three estimators discussed above induces a selection between the 
restricted model Mr consisting only of the N(0, l)-distribution and the unre- 
stricted model M(j = {N(6, 1) : 9 £ M.} in an obvious way, i.e., Mr is selected 

1 The penalty corresponding to hard thresholding given in Fan and Li (2001) differs from 
the correct one that we use here, because of a scaling error in equations (2.3) and (2.4) of Fan 
and Li (2001). 
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if the respective estimator for 9 equals zero, and Mjj is selected otherwise. In 
the present context, the hard-thresholding estimator 9h is furthermore nothing 
else than a traditional pre-test estimator that chooses between the unrestricted 
maximum likelihood estimator 9jj — y and the restricted maximum likelihood 
estimator 9r = according to the outcome of a t-type test for the hypothesis 
6» = 0. 

We now study the model selection probabilities, i.e., the probabilities that 
model Mu or Mr, respectively, is selected. As they add up to one, it suffices 
to consider one of them. First note that the probability of selecting the model 
Mr is the same for each of the estimators 9h, @s, and 8 scad (provided the 
same tuning parameter r\ n is used). This is so because the events {9h = 0}, 
{9s = 0}, and {Qscad — 0} coincide. Hence, 

Pn.e 0= 0) = P n . e (\y | < ry n ) = Pr ( 

where 9 stands for any of the estimators 9h, @s, and 9 scad, and where Z is 
a standard normal random variable with cumulative distribution function (cdf) 
<&. Here we use P n fi to denote the probability governing a sample of size n when 
9 is the true parameter, and Pr to denote a generic probability measure. 

In the following we shall always impose the condition that r\ n — > for asymp- 
totic considerations, which guarantees that the probability of incorrectly select- 
ing the restricted model Mr (i.e., selecting Mr if the true 9 is non-zero) vanishes 
asymptotically. Conversely, if this probability vanishes asymptotically for every 
9^0, then r) n ^ follows. Therefore, the condition r) n — > is a basic one 
and without this condition the estimators 9h, #s> and 9 scad do not seem to 
be of much interest (from an asymptotic viewpoint). As we shall see in the 
next section, this basic condition is also equivalent to consistency for 9 of the 
hard-thresholding (soft-thresholding, SCAD) estimator. 

Given the condition rj n — > 0, two cases need to be distinguished: (i) n 1 / 2 rj n — > 
e, < e < oo and (ii) n 1 ^ 2 r) n — > e = ooj^j In case (i) the hard-thresholding (soft- 
thresholding, SCAD) estimator acts as a conservative model selection procedure, 
i.e., the probability of selecting the unrestricted model Mu has a positive limit 
even when 9 — 0, whereas in case (ii) it acts as a consistent model selection 
procedure, i.e., this probability vanishes in the limit when 9 — 0. This is 
immediately seen by inspection of ([lj. These facts have long been known, see 
Bauer, Potscher, and Hackl (1988). 

The results discussed in the preceding paragraph are of a 'pointwisc' asymp- 
totic nature in the sense that the value of 9 is held fixed when sample size n goes 
to infinity. As noted before, such pointwise asymptotic results often miss es- 
sential aspects of the finite-sample behavior, especially in the context of model 
selection; cf. Leeb and Potscher (2005). To obtain large-sample results that 
better capture finite-sample phenomena, we next present a 'moving parameter' 

2 There is no loss in generality here in the sense that the general case where only r\ n — > 
holds can always be reduced to case (i) or case (ii) by passing to subsequences. 
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asymptotic analysis, i.e., we allow to vary with n as n — > oo. The following 
result shows in particular that convergence of the model selection probability 
to its limit in a pointwisc asymptotic analysis is not uniform in £ K (in fact, 
it fails to be uniform in any neighborhood of 9 = 0). 

Proposition 1 Let 9 be either Oh, ®s, or @scad- Suppose that rj n — > and 
n Vn ~ * e w ith < e < oo- 

(i) Assume e < oo (corresponding to conservative model selection). Suppose the 
true parameter n £ K satisfies n 1 / 2 9 ri — > v £ K U {— oo, oo}. Then 

lim P n ,e n = 0) = + e) - $(-zv - e). 

(mJ Assume e = oo (corresponding to consistent model selection). Suppose 
n £ R satisfies n /rj n — » £ e K U {— oo, oo}. TTien 

1. |C| < 1 implies lim n ^oo P n ,8 n (0 = 0) = 1; 

2. |C| = 1 and w 1 / 2 (r7 n — C^n) r for some r £ RU{- oo,oo} ; implies 
lim„^ oo i 5 nA(« = 0)=$(r); 

,?. |C| > 1 implies lim^oo P n ,e n (0 = 0) = 0. 

Proof. The proof of part (i) is immediate from 0. To prove part (ii) we use 
|lj to rewrite P n ,e n ifi = 0) as 

P n ,e n = O) = Hn l/2 Vn(l-6n/ri n ))-®(n 1/2 r)n(-l-(>n/Vn))- 

The first and the third claim follow immediately from this. For the second 
claim, assume first that £ = 1. Then <f>(n 1/2 ry„(l - n /rj n )) = $(n 1/2 (rj n - (9 n )) 
obviously converges to 3>(r), whereas < I>(n 1 / 2 ?7„(— 1 — S-n/Vn)) converges to zero. 
The case ( — — 1 is handled similarly. ■ 

Proposition [T] in fact completely describes the large-sample behavior of the 
model selection probability without any conditions on the parameter 0, in 
the sense that all possible accumulation points of the model selection proba- 
bility along arbitrary sequences of n can be obtained in the following man- 
ner: Just apply the result to subsequences and note that, by compactness of 
M U {— oo, oo}, we can select from each subsequence a further subsequence such 
that all relevant quantities such as n 1//2 0„, n /r] n , n 1 / 2 {r] n — n ), or n 1 / 2 (?7„ + 6' Jl ) 
converge in RU {—00,00} along this further subsequence. 

In the conservative model selection case we see from Proposition [l] that the 
usual local alternative parameter sequences describe the asymptotic behavior. 
In particular, if n is local to 9 = in the sense that n = v/n 1 ^ 2 , the local 
alternatives parameter v governs the limiting model selection probability. De- 
viations of n from = of order 1/n 1 / 2 are detected with positive probability 
asymptotically and deviations of larger order are detected with probability one 
asymptotically in this case. In the consistent model selection case, however, a 
different picture emerges. Here, Proposition [T| shows that local deviations of n 
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from = that are of the order 1/n 1 / 2 are not detected by the model selection 
procedures at allj^j In fact, even larger deviations of 6 from zero go asymptoti- 
cally unnoticed by the model selection procedure, namely as long as O n jr\ n ~ * Ci 
|C| < 1. [Note that these larger deviations would be picked up by a conservative 
procedure with probability one asymptotically] This unpleasant consequence 
of model selection consistency has a number of repercussions as we shall see 
later on. For a more detailed discussion of these phenomena in the context of 
post-model-selection estimators see Leeb and Potscher (2005). 

The speed of convergence of the model selection probability to its limit in 
part (i) of the proposition is governed by the slower of the convergence speeds 
of n 1 / 2 rj n and n l ' 2 n . In part (ii) it is exponential in n 1 ^ 2 rj n in cases 1 and 3, 
and is governed by the convergence speed of n 1 ^ 2 rj n and n 1 l 2 (r\ n — (0 n ) in case 
2. 



4 Consistency, uniform consistency, and uni- 
form convergence rate of 9h, 9s, an d 9 scad 

It is easy to see that the basic condition r\ n — > discussed in the preceding 
section is in fact also equivalent to consistency of Oh for 0, i.e., to 



lim P n , 



' H 



> e ) = for every e > and every 6 



The same is also true for 0$ and Oscad, as is elementary to verify. [At least 
the sufficiency parts are well-known, see Potscher (1991) for hard-thresholding, 
Knight and Fu (2000) for soft-thresholding] and Fan and Li (2001) for SCAD.] 
In fact, under this basic condition on T) n , the estimators are even uniformly 
consistent with a certain rate as we show next: 



Theorem 2 Assume r\ n — > 0. Let stand for either Oh, 0s> ot Oscad- Then 
9 is uniformly consistent, i.e., 



lim sup P n 

n-t-oo 9gH 



(y0 — > e^j = for every e > 0. 



In fact, the supremum in the above expression converges to zero exponentially 
fast for every e > 0. Furthermore, set a n = minjn 1 / 2 , ry" 1 }. Then for every 
e > there exists a (nonnegative) real number M such that 



sup sup P n 

hen eeR 



> M 



< 



holds. In particular, is uniformly minjn 1 / 2 j?^ 1 } -consistent. 



3 For such deviations this also immediately follows from a contiguity argument. 
4 Knight and Fu (2000) consider the LASSO-estimator in a linear regression model without 
an intercept, hence their result does not directly apply to the case considered here. 
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Proof. We begin with proving uniform consistency of = 6h- Observe that 
sup 0gR Pn,e{\&H — 0\ > e) can be written as 



su P P„, e ( (y-e)l(\y\>r ln )-ei(\y\<r ]n ) > e) 



eel 

< sup P n , g (\y-0\ >e/2, \y\ > r, n ) + sup P nfi (\6\ > e/2, \y\<v n ) 
6>eR eeR 

< Yr(\Z\>n 1 ' 2 e/2) + sup P nfi {\y\ < r?J, 

\9\>e/2 

where Z is standard normally distributed. Now the first term on the far r.h.s. 
in the above display obviously converges to zero exponentially fast as n — > oo. 
In the second term on the far right, the probability gets large as \8\ gets close 
to e/2. Therefore, the second term on the far r.h.s. equals 



(\Z+ n l ' 2 e/2\ < n^ Vn ) = <$>{n l ' 2 {-e /2 + „„)) - <$>{n l l 2 {-e /2 - r?J) 



Pr 



and also goes to zero exponentially fast because rj n — > 0. 

Next, for the soft-thresholding estimator, observe that we have the relation 



0s = H -sign(9 H )v n - 



(2) 



Consequently, 



sup P n , 8 
6»eR 



'H 



-0s 



> e) < supP n . e (f] n >e) = 1(t?„ > e), 

' 6»GR 



which equals zero for sufficiently large n. Hence, the results established so far 
for 9h carry over to 9s- 

For the SCAD estimator observe that it is 'sandwiched' between the other 
two in the sense that 

&s < 6 scad < Oh (3) 

holds if 9 S > 0, and that the order is reversed if 8 S < 0. This entails the 
corresponding result for the SCAD estimator. 

We next prove uniform a„-consistency of Oh'- Repeating the arguments 
from the beginning of the proof with M/a n replacing e, we see that 
sup0 gR P n ,e{a n \QH — 0\ > M) is bounded from above by 

Pr > n 1 ' /2 M/(2a„)) + Pr (\Z + n 1/2 M/(2a„) | < n 1 / 2 ^) • 

Because n 1 / 2 /a n > 1, the first term on the right-hand side of the above expres- 
sion is not larger than Pr(|Z| > M/2). The second term equals 

$ (-n 1 l 2 M/{2a n ) + n 1 ' 2 ^) - $ (-n 1 / 2 M / \2a n ) - n 1 ' 2 ^ 
= $ ((n 1 /2/ a „)(-M/2 + a„r?j) - <f> ((n 1 / 2 /a„)(-M/2 - a„r?„)) . 
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Note that n 1 / 2 /a n > 1 and a n r\ n < 1. For M > 2, the expression in the above 
display is therefore not larger than $(— M/2 + 1). Uniform a„-consistency of 
Oh follows from this. The proof for 9s and 6 scad is then similar as before. ■ 
For the case where the estimators 9h , &s> an d 9 scad are tuned to perform 
conservative model selection, the preceding theorem shows that these estimators 
are uniformly n 1 / 2 -consistent. In contrast, in case the estimators are tuned to 
perform consistent model selection, the theorem only guarantees uniform r?" 1 - 
consistency; that the estimators do actually not converge faster than r\ n in a 
uniform sense in this case will be shown in Section [5.2.21 

Remark 3 Let 9 denote any one of the estimators 9h, 9s, or 9 scad- In case 
nl ^ 2r ln ^ e = it is easy to see that 9 is uniformly asymptotically equivalent to 

@U — V m the sense that lim n _>oo sxrpg e ^P n! e (n 1 / 2 |# — y > e^j — for every 

e > 0. [For 9 = 9h, this follows easily from Proposition [l] for 9 — 9s it follows 
then from and for 9 = 9 scad from (j^J) .] 

5 The distributions of 6jj, 9$, and Oscad 
5.1 Finite-sample distributions 

For purpose of comparison we note the obvious fact that the distribution of the 
unrestricted maximum likelihood estimator 9jj = y (corresponding to model 
Mjj) as well as the distribution of the restricted maximum likelihood estimator 
9r = (corresponding to model Mr) are normal; more precisely, n 1 / 2 (0y — 9) 
is N(0, l)-distributed and n 1 / 2 (0 fl - 9) is N{-n 1 / 2 9, 0)-distributed, where the 
singular normal distribution is to be interpreted as pointmass at —n 1 / 2 9. For the 
hard-thresholding estimator, the finite-sample distribution Fh,u,6 of n 1 / 2 ^^— 9) 
is of the form 

dF H ^ e {x) = |<i>(n 1 / 2 (-0 + ?/ „))-<i>( ? i 1 / 2 (-0-7 7n ))}^_ nl/2e (x) 

/ \ ( 4 ) 

+ (f>(x) 1 ( x + n 1/2 9 > n 1/2 r] n ) dx, 

where S z denotes pointmass at z and 4> denotes the standard normal density. 
Relation (k| is most easily obtained by writing P„ i e(^ 1 ^ (y h — &) < x) as the sum 
of P nfi {nP 2 {9 H -9) <x,9 H = 0) and P n M(n 1/2 (9 H -9) <x,9 H ^ 0). This also 
shows that the two terms in Q correspond to the distribution of n}/ 2 {9H — 9) 
conditional on the events {9h = 0} and {9h ^ 0}, respectively, multiplied by 
the probability of the respective events. Relation ([3} also follows as a special 
case of Leeb and Potscher (2003), which provides the finite-sample as well as 
the asymptotic distributions of a general class of post-model-selection estima- 
tors. We recognize that the distribution of the hard-thresholding estimator is 
a mixture of two components: The first one is a singular normal distribution 
(i.e., pointmass) and coincides with the distribution of the restricted maximum 
likelihood estimator. The second one is absolutely continuous and represents an 
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'excised' version of the normal distribution of the unrestricted maximum likeli- 
hood estimator. Note that the absolutely continuous part in Q is bimodal and 
hence is distinctly non-normal. The shape of the distribution of n 1 / 2 (0jj — 9) is 
exemplified in Figure 1. 



Hard-Thresholding 




I 1 1 1 1 1 1 

-3-2-10 1 2 3 

X 



Figure 1: Distribution of n x l 2 {9H — 9) for n = 40, 9 — 0.16, and 
r\ n = 0.05. The density of the absolutely continuous part is shown by 
the solid curve, which is discontinuous at x = n 1 / 2 {—9 — T) n ) and x — 
n 1 / 2 {—9 + , q n ). [For better readability, the left- and right-hand limits 
at discontinuity points are joined by line segments.] The vertical 
dotted line indicates the location of the point-mass at —n 1 ^ 2 9; the 
weight of the point-mass, i.e., the multiplier of d8_ n i/ag(x) in Q, 
equals 0.15. For other values of the constants involved here, a similar 
picture is obtained. 

The finite-sample distribution Fs, n ,e of n 1 / 2 (^5 — 9) is given by 

dF s , n ,o(x) = {^{n 1 ' 2 {~9 + Vn ))~^{n 1 l 2 {-9-i ln )))d5_ nl/ , e {x) 

+ <j>(x - n 1/2 r] n ) l(x + n 1/2 9 < 0) dx ( 5 ) 
+ (j)(x + n 1/2 rj n ) l(x + n 1/2 9 > 0) dx. 

For later use we note that this implies 

Fsne(x) = ^x + n 1 / 2 r ln ))l{x>-n 1 / 2 9) + <5>{x~n 1 l 2 r ln ))l{x<-n 1 / 2 9). 

(6) 
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Relation ([5]) is obtained from a derivation similar to that of Q, namely by 
representing P ni6 )(n 1 / 2 (6 l 5 — 0) < x) as the sum of P n , g (n 1 / 2 (9 S — 9) <x,9 s = 0), 
Pn,e(n 1/2 (0 S - 6) < x,9 s > 0), and P n ,e{n 1/2 {9 S - 9) < x,9 s < 0). Similar to 
before, the three terms in ^ correspond to the distributions of rt 1 / 2 (05 — 9) 
conditional on the events {9s = 0}, {9s > 0}, and {9s < 0}, respectively, 
multiplied by the respective probabilities of these events. The distribution in 
|5| is again a mixture of a singular normal distribution and of an absolutely 
continuous part, which is now the sum of two normal densities, each with a 
truncated tail. Figure 2 exemplifies a typical shape of this distribution. 



Soft-Thresholding 
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Figure 2: Distribution of n 1 ^ 2 ((95 — 9). The choice of constants and 
the interpretation of the image is the same as in Figure 1. 

The finite-sample distribution of the SCAD-estimator is obtained in a similar 
vein: Decomposing the probability P„ e(n 1 ' /2 (f3 'scad — &) < %) into a sum of 
seven terms by decomposing the relevant event into its intersection with the 
events {\y\ < r/ n }, {r) n < y < 2r/„}, {2i) n < y < ar) n }, {arj n < y}, {-2r) n < 
y < —T) n }, {—ai] n — V < — 2?7„}, and {y < —ar) n }, shows that the distribution 
F S cAD,n,e of n 1 / 2 (9 scad - 0) is of the form 

dFscAD.nA*) = \<f(n^ 2 (~9 + r ln ))~<i>(n^ 2 (-9-r ]n ))} d5_ n y 26 {x) 

(7) 

+ {fi(x) + h{x) + h(x) + f_i(x) + /_ 2 (x) + fs(x)\ dx, 
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where 

h(x) = 0(x + n 1/2 r/„) l(o<x + n 1/2 < n 1 / 2 ^) , 

h{x) = ({(« - 2)x - n^H + an 1 ' 2 ^} /(a - 1)) x 

1 (n 1/2 r ln <x + n 1/2 9 < an 1/2 r]^j , 

f 3 (x) = 4>{x) l(x + n 1 / 2 e>n 1 / 2 ar ln ), 

and where f_i(x), f~2( x ), and fs(x) are defined as fi(x), f2{%), and fa(x), 
respectively, but with — x replacing x and with —9 replacing 9 in the formulae. 
Like in the case of the other estimators, the distribution of the SCAD-estimator 
is a mixture of a singular normal distribution and an absolutely continuous 
part, the latter being more complicated here as it is the sum of six pieces, 
each obtained from normal distributions by truncation or excision. As shown 
in Figure 3, the absolutely continuous part of FscAD,n,e can be multimodal. 



SCAD 




I 1 1 1 1 1 1 

-3-2-10 1 2 3 

x 



Figure 3: Distribution of n 1 / 2 (6 'scad — &)■ The tuning-parameter a 
is chosen as a = 3.7 here, cf. Fan and Li (2001); the choice of the 
other constants and the interpretation of the image is the same as in 
Figure 1. The graph for the SCAD estimator coincides with that for 
the soft-thresholding estimator inside a neighborhood of the location 
of the atomic part at —n 1 / 2 9 (vertical dotted line), and with that for 
the hard-thresholding estimator outside of a (larger) neighborhood 
of —n 1 l 2 9. The area between these two regions corresponds to the 
dips shown in the figure. 
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In summary, wc sec that the finite-sample distributions of the estimators 9jj, 
9s, and 9 scad are typically highly non- normal and can be multimodal. As a 
point of interest, we also note that the computations leading to the above for- 
mulae also deliver the conditional finite-sample distributions of the estimators 
9h, 9s, and 9 scad, respectively, conditional on selecting model M R or M\j. 
In particular, we note that the conditional distribution of each of these esti- 
mators, conditional on having selected the restricted model Mr, coincides with 
the distribution of the restricted maximum likelihood estimator 9 a; in contrast, 
conditional on selecting the unrestricted model Mu, the conditional distribu- 
tion is not identical to the distribution of the unrestricted maximum likelihood 
estimator 9jj, but is more complicated. This phenomenon applies also to large 
classes of post-modcl-sclcction estimators; see Potscher (1991) and Leeb and 
Potscher (2003) for more discussion. 

5.2 Asymptotic distributions 

We next obtain the asymptotic distributions of 9h, 9s, and 9 scad- We present 
the asymptotic distributional results under general 'moving parameter' asymp- 
totics, where the true parameter 9 n can depend on sample size, because consid- 
ering only fixed-parameter asymptotics may paint a quite misleading picture of 
the behavior of the estimators (cf. Leeb and Potscher (2003, 2005)). In fact, 
the results given below amount to a complete description of all possible accu- 
mulation points of the finite-sample distributions of the estimators in question, 
cf. Remarks |8] and [12] Not surprisingly, the results in the conservative model 
selection case are different from the ones in the consistent model selection case. 

5.2.1 Conservative case 

Here we characterize the large-sample behavior of the distributions of 9h, 9s, 
and 9 scad for the case where these estimators are tuned to perform conservative 
model selection. 

Theorem 4 Consider the hard-thresholding estimator with 77^ — > and 
nl ^ 2r ln ~ * e > < e < 00. Suppose the true parameter 9 n G K satisfies 
n 1 / 2 9 n -^i^elU {—00,00}. Then FH,n,6„ converges weakly to the distribution 
given by 

{<£>(-v + e) -$(-z/-e)} dS- v {x) + (/>(x)l(\x + v\ > e) dx. (8) 
[Note that reduces to a standard normal distribution in case \v\ =00 or 

e = o.y 

Proof. [^] Recall that the finite-sample distribution is given in Q . Convergence 
of the weights $(n 1 / 2 (-6» + 77J) - $(n 1 / 2 (-0 - r) n )) to $(-i/ + e) - $(-1/ — e) 

5 Theorem [4] is essentially a special case of results obtained in Leeb and Potscher (2003) for 
a more general class of post-model-selection estimators. The proof of this result is included 
here because of its brevity and illustrative value. 
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is obvious (cf. proof of Proposition 1). Hence, the atomic part of FH, n ,e n 
converges weakly to the atomic part of QSJ) if \u\ < oo and e > 0; if \v\ = oo 
or if e = 0, the total mass of the atomic part converges to zero. The density 
of the absolutely continuous part of F H . n Sn is easily seen to converge Lebesgue 
almost everywhere (in fact everywhere on R except possibly at x = —v±e) 
to the density of the absolutely continuous part of Also the total mass 
of the absolutely continuous part is seen to converge to the total mass of the 
absolutely continuous part of pj. By an application of Scheffe's Lemma, the 
densities converge in absolute mean, and hence the absolutely continuous part 
converges in the total variation sense. ■ 

The fixed-parameter asymptotic distribution is obtained from Theorem [4] by 
letting 9 n = 9: If 9 = 0, the pointwise asymptotic distribution of the hard- 
thresholding estimator is seen to be 

{$(e) - $(-e)} d5 (x) + </>(x) l(\x\ > e) dx, 

which coincides with the finite-sample distribution Q in this case except for 
replacing n 1 ^ 2 rj n by its limiting value e. However, if 9 0, the pointwise asymp- 
totic distribution is always standard normal, which clearly misrepresents the ac- 
tual distribution Q. This disagreement is most pronounced in the statistically 
interesting case where 9 is close to, but not equal to, zero (e.g., 9 ~ nT 1 / 2 ). In 
contrast, the distribution ^ much better captures the behavior of the finite- 
sample distribution also in this case because (|8| coincides with ^ except for 
the fact that n 1 ^ 2 rj n and n 1 l 2 9 n have settled down to their limiting values. 

Theorem 5 Consider the soft-thresholding estimator with r\ n — > and 
< e < oo. Suppose the true parameter 9 n G K satisfies 
— oo,oo}. Then Fg^ n g n converges weakly to the distribution 

given by 

{®{-v + e) - - e)} d6-„(x) 

+ {4>{x + e)l(x > -v) + 4>(x - e)l(x < —v)} dx. 

[Note that reduces to a N{— sign(^)e, 1) -distribution in case \v\ = oo or 

e = o.y 

The proof is completely analogous to the proof of Theorem [4] Since soft- 
thresholding arises as a special case of the LASSO-estimator, the above result 
is closely related to the results in Knight and Fu (2000)0 Similar to the case 
of hard-thresholding, a fixed-parameter asymptotic analysis only partially re- 
flects the finite-sample behavior of the estimator: In case 9 = 0, the pointwise 
asymptotic distribution is 

{$(e) - $(-e)} dS (x) + {(j)(x - e)l(x < 0) + <j)(x + e)l(x > 0)} dx. 

6 Since Knight and Fu (2000) consider the LASSO-estimator in a linear regression model 
without an intercept, their results do not directly apply to the model considered here. How- 
ever, their results can easily be modified to also cover linear regression with an intercept. 
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However, if 6 ^ 0, the pointwise limit distribution is N(— sign(0)e, 1), which 
is not in good agreement with the finite-sample distribution (j5j, especially in 
the statistically interesting^ase where 9 is close to, but notjejjual to, zero (e.g. 



_1 / 2 ). In contrast, ^ is in better agreement with ^ also in this case 
in the sense that ^ coincides with except that n x l 2 r\ n and n 1 ^ 2 9 n have 
settled down to their limiting values. 



Theorem 6 Consider the SCAD estimator with r\ n — > and n 1 l 2 r\ n — > e, < 
e < oo. Suppose the true parameter 9 n € K satisfies n}/ 2 9 n — > i/ G KU{— oo, oo}. 
T/ien FscAD,n,o n converges weakly to the distribution given by 

{$(-v + e) - - e)} d<5_„0) 

+ |(/>(x + e)l(0 < x + v < e) + (j>{x - e)l(-e < x + ^ < 0) 



a- 2 

a- 1 
a - 2 

a- 1 



0({(a — 2)x — 2/ + ae} /(a — l))l(e < x + v < ae) 
0({(a — 2)a; — v — ae} /(a — 1))1(— ae < a; + ^ < — e) 



(10) 



+ 0(a;)l(|x + v\ > ae)| rfx. 



[Note that (10) reduces to a standard normal distribution in case \v\ — oo or 

e = o.y 

The proof of Theorem [6] is again completely analogous to that of Theorem 
|4j As with the hard- and soft-thresholding estimators discussed before, a fixed- 
parameter asymptotic analysis of the SCAD estimator only partially reflects its 
finite-sample behavior: In case 9 = 0, the pointwise asymptotic distribution is 



given by ( 10 ) with v = 0, but in case 9 ^ it is given by N(0, 1), which is defi- 
nitely not in good agreement with the finite-sample distribution Q, especially 
in the statistically interesting case where 9 is different from, but close to, zero, 
e.g., 9 ~ n -1 / 2 . In contrast, (10 1 is in much better agreement with (|7| in view 
of the fact that (10 1 coincides with except that n 1 l 2 r\ n and rv^ 2 9 n have 
settled down to their limiting values. 

We note that the mathematical reason for the failure of the pointwise asymp- 
totic distribution to capture the behavior of the finite-sample distribution well is 
that the convergence of the latter to the former is not uniform in the underlying 
parameter 9. See Leeb and Potscher (2003, 2005) for more discussion in the 
context of post-model-selection estimators. 

Remark 7 If \v\ = oo, or e = 0, or n 1 t 2 9 n = v does not depend on n, the 
convergence in the above three theorems is even in the total variation distance. 
In the first two cases this follows because the total mass of the atomic part con- 
verges to zero; in the third case it follows because the location of the pointmass 
is independent of n. 

Remark 8 The above theorems actually completely describe the limiting be- 
havior of the finite-sample distributions of 9h, 9s, and 9 scad without any 
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condition on the sequence of parameters 9 n . To see this, just apply the theo- 
rems to subsequences and note that by compactness ofKU{— 00,00} we can 
select from every subsequence a further subsequence such that n 1 / 2 9 n converges 
inKU{- 00, 00} along this further subsequence. 

5.2.2 Consistent case 

In the case where the estimators 9h, 9s, and 9 scad are tuned to perform con- 
sistent model selection (i.e., 77^ — s- and n 1 ^ 2 rj n — > 00), the fixed-parameter 
limiting behavior of the finite-sample distributions is particularly simple: The 
finite-sample distribution of the hard-thresholding estimator converges to the 
iV(0, 0)-distribution (i.e., to pointmass at 0) if 9 = 0, and to the N(0, re- 
distribution if 9 ^ 0; cf. Lemma 1 in Potscher (1991). In other words, the 
pointwise asymptotic distribution of n x l 2 (9H — 9) coincides with the asymptotic 
distribution of the restricted maximum likelihood estimator if 9 — 0, and coin- 
cides with the asymptotic distribution of the unrestricted maximum likelihood 
estimator if 9 ^ 0. The hard-thresholding estimator, when tuned in this way, 
therefore satisfies what has sometimes been dubbed the 'oracle' property in the 
literature^ The SCAD-estimator with the same tuning is also known to possess 
the 'oracle' property; cf. Fan and Li (2001). With the same tuning, the soft- 
thresholding has a somewhat different pointwise asymptotic behavior which is 
discussed later. 

The 'oracle' property of the hard-thresholding estimator and the SCAD- 
estimator implies in particular that both estimators are n 1 / 2 -consistent. In 
Theorem [2] however, we have in contrast to the conservative model selection 
case - only been able to establish uniform ^^-consistency and not uniform 
n 1 / 2 -consistency. This begs the question whether Theorem |2] is just not sharp 
enough or whether the estimators actually are not uniformly n^-consistent. It 
furthermore raises the question of the behavior of the finite-sample distributions 
of n x l 2 {0H — 9), n 1 / 2 (05 — 9), and n x l 2 (9 scad — 9) in a 'uniform' asymptotic 
framework. The three results that follow answer this by determining the limits 
of the finite-sample distributions of 9h, 9s, and Oscad under general 'moving 
parameter' asymptotics when the estimators are tuned to perform consistent 
model selection. 

Theorem 9 Consider the hard-thresholding estimator with r\ n — > and 
nl ^ 2r ln ~ * 00 • Assume that n /n n —> £ for some ( € RU {—00,00} and that 
n 1 l 2 9 n v for some v G K U {—00,00}. [Note that in case ( ^ the con- 
vergence of n x l 2 9 n already follows from that of 9 n jr\ n , and v is then given by 
v = sign(C)oo./ 

1. If\Q\ < 1, then Fjj n,9 n approaches pointmass at —v. In case \v\ < 00, this 
means that FH, n ,e n converges weakly to pointmass at —v; in case \v\ = 00. 

7 This does not come as a surprise, since post-model-selection estimators based on a con- 
sistent model selection procedure in general satisfy the 'oracle' property as already noted in 
Lemma 1 of Potscher (1991); but see also the warning issued in the discussion following that 
lemma. 
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this means that the total mass of FH, n ,e„ escapes to —v, in the sense that 
FH,n,e n (x) —>■ for every i£l if — v = oo, and FH, n ,e„(x) — > 1 for every 
x € R if —v = — oo. 

2. If |£| = 1 and n x l 2 (r\ n — C,9 n ) — > r /or some r 6 1U{- oo,oo}, i/ien 
FH,n,9 n { x ) converges to 

$(r)l(C=l)+ / 0(u)l(C" > 

J —oo 

for every x € R. TViis Zimif corresponds to pointmass at —v = sign(— £)oo 
if r = oo, and otherwise represents a convex combination of pointmass 
at —v = sign(— £)oo and an absolutely continuous distribution whose den- 
sity, a kind of truncated standard normal, is given by (1 — ^(r))" 1 times 
the integrand in the above formula; the weights in that convex combina- 
tion are given by <&(r) and (1 — 3>(r)), respectively. [The weight of the 
absolutely continuous component equals one in case r = — oo; in this case, 
convergence is in fact in total variation distance.] 

3. If 1 < |£| < oo, then FH, n ,e„ converges weakly to <!>, the standard normal 
cdf [In fact, convergence is in total variation distance.] 

Proof. Proposition [T] shows that the total mass of the atomic part of F^ n j n 
converges to one under the conditions of part 1. Because the atomic part is 
located at — n 1//2 #„ in view of part 1 follows immediately. 

For part 2, assume first that £ = 1. Proposition [l] shows that the total mass 
of the atomic part of FH, n ,e n converges to <f>(r). Furthermore, n x l 2 Q n — > oo 
certainly holds, which implies that the atomic part escapes to — oo. If r = oo, we 
are hence done. Suppose now that r < oo. In Q, the boundaries of the 'excision 
interval' of the absolutely continuous part of FH, n .e„, i-e., —n 1 ^ 2 (rj n + n ) = 
—n 1 / 2 rj n {\ J r8 n /rj n ) and n 1 ^ 2 (n n — 6 n ) then converge to — oo and r, respectively. 
This shows that 

4>{x) l{\x + n 1/2 6 n \ > n 1/2 n n ) -> cf)(x) l(x > r) 

for Lebesgue almost every The Dominated Convergence Theorem then 

shows that the convergence in the above display also holds in absolute mean. 
This completes the proof of part 2 in case £ = 1. The case where £ = — 1 is 
treated similarly. 

Under the conditions of part 3, Proposition [T] shows that the total mass of 
the absolutely continuous part converges to one. Furthermore, the boundaries 
of the 'excision interval' in Q, i.e., — n x l 2 {r\ n + 6 n ) = —n x l 2 ri n (\ + n /r) n ) and 
n 1 / 2 {n n — 6 n ) = n 1 / 2 ?7 n (l — 9 n /rj n ), diverge either both to oo or both to — oo, 
because |C| > 1. This implies that 

<f>(x) l(\x + n^ 2 9 n \ > 7i 1 / 2 7 7 „) -> (j>{x) 

for every x 6 R. Together with the Dominated Convergence Theorem this 
completes the proof. ■ 
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The fixed-parameter asymptotic behavior of the hard-thresholding estima- 
tor discussed earlier, including the 'oracle' property, can clearly be recovered 
from the above theorem by setting 9 n = 9. However, the theorem shows that 
the asymptotic behavior of the hard-thresholding estimator is more complicated 
than what the 'oracle' property predicts. In particular, the theorem shows that 
the hard-thresholding estimator is not uniformly n 1 / 2 -consistent as the sequence 
of finite-sample distributions is not stochastically bounded in all cases. [In that 
sense scaling by n 1 ! 2 does not appear to be the natural thing to do, see the 
discussion below as well as the appendix.] Furthermore, as shown by Q, the 
finite-sample distribution is highly non-normal, whereas the pointwise asymp- 
totic distribution is always normal and thus can not capture essential features 
of the finite-sample distribution. In contrast, the asymptotic distribution given 
in Theorem [9] is also non-normal in some cases. All this goes to show that 
the 'oracle' property, which is based on the pointwise asymptotic distribution 
only, paints a highly misleading picture of the behavior of the hard-thresholding 
estimator and should not be taken at face value A result for a certain post- 
model-selection estimator that is related to Theorem [9] above can be found in 
Appendix A of Leeb and Potscher (2005). 

Theorem 10 Consider the soft-thresholding estimator with rj n — > and 
n 1 l 2 r\ n — > oo. Assume that v}^6 n — » v 6 R U {— oo,oo}. Then Fs }Tll $ n ap- 
proaches pointmass at —v. In case \v\ < oo, this means that Fs, n ,S n converges 
weakly to pointmass at —v; in case \v\ = oo, it means that the total mass of 
Fs,n,e„ escapes to —v, in the sense that Fs,n,e n i x ) ~> for every x G R if 
—v = oo, and Fs !n ,e„{%) ~ * 1 f or every x £ R if —v = — oo. 

Proof. From M we have that Fs, n ,0 n ( x ) = ^{x + n 1 ^ 2 ^^ for x > — n 1 / 2 9 n and 
Fs,n,e„(x) = $(x — n 1 / 2 ?7 n ) for x < —r^^On. Because n 1 / 2 ?y„ — > oo, this entails 
that Fs, n ,6 n {x) converges to one for each x > — v and to zero for each x < —v. 
■ 

The fixed-parameter asymptotic distribution of the soft-thresholding esti- 
mator is obtained by setting 8 n = 9 in the above theorem: It is iV(0,0) (i.e., 
pointmass at 0) if 9 — 0; if 6 ^ the total mass of the finite-sample distribution 
escapes to sign(— 0)oo. Hence, the soft-thresholding estimator when tuned to 
act as a consistent model selector is not even pointwise n^-consistent (Zou 
(2006)) and certainly does not satisfy the 'oracle' property. [This contradicts 
an incorrect claim in Zhao and Yu (2006, Section 2.1) to the effect that tun- 
ing LASSO to act as a consistent model selector results in an asymptotically 
normal estimator.] The fact that this estimator is not pointwise n 1 / 2 -consistent 
also suggest studying the asymptotic distribution under a scaling that increases 
slower than n 1 / 2 , an issue that we take up further below; cf. also the appendix. 

Theorem 11 Consider the SCAD estimator with rj n — > and n 1 l 2 j] n — > oo. 
Assume that 9 n jr\ n — » C for some ( £ RU {—00,00} and that n 1 / 2 9 n — » v for 

8 This is of course not new and has been observed more than 50 years ago in the context 
of Hodges' estimator. For more discussion of the problematic nature of the 'oracle' property 
see Leeb and Potscher (2008a). 
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some v G K U {— 00,00}. [Note that in case ( =/= the convergence of n 1 l 2 9 n 
already follows from that of ' O n /rj n , and v is then given by v = sign(£)oo.y 

1. If |C| < a, or if |C| = a and n x l 2 (ar\ n - sign(C)#„) — > 00, then F S cAD,n,e n 
approaches pointmass at —v. In case \i>\ < oo, this means that FscAD,n,6„ 
converges weakly to pointmass at —v; in case \v\ = 00, it means that the to- 
tal mass of FscAD,n,e„ escapes to —v, in the sense that FscAD,n,e n (x) — > 
for every x <E K if —v = 00, and FscAD.n,e n { x ) — > 1 /or ei>er?/ a; G K z/ 
— v = —00. 

2. // |£| = a and n x l 2 (ar\ n — sign(£)#„) — > r /or some relU {—00}, i/ien 
FscAD,n,9 n (x) converges to 

{(a - 2)u + sign(C)r} /(a - 1)) l(sign(0« < r) 
, a — 1 V / 

+ l(sign(C)w > r) cfot 

/or every x € R, mi/i i/ie convention that the integral over the first term 
in the above expression is zero if r — —00. [In fact, convergence is in total 
variation distance.] 

3. If a < ICI < 00, then FscAD,n,e n converges weakly to the standard normal 
distribution N(0, 1). [In fact, convergence is in total variation distance.] 

Proof. For each 9, the cdf F$cad n,6 consists of contributions from the atomic 
part and from the absolutely continuous part. The contribution of the absolutely 
continuous part can be further broken down into the contributions from the 
integrands fi, / 2 , /3, /_i, /_2, and /_ 3 in view of We hence may write 

FscAD,n,e(x) = F , n ,e( x ) + F i,n,e( x ) +F2,n,e{x) + F 3tn> e(x) 

+ F-i tflt e(x) + F- 2 ,n,e( x ) + F-3, n ,e(x), 

where F 0tn g denotes the contribution of the atomic part, and where the remain- 
ing terms on the right-hand side denote the contributions corresponding to fx, 
f2, h, f-i, f-2, and /_ 3 , respectively; e.g., F lini e(x) = f* x fi(u)du. Now 
Fi. n .e n (x) can be written as 

Fi, n ,eAx) = f + ^ ( f > (z)l(n^ 2 (r ln -9 n ) < z<n^ 2 (2r, n -9 n )) dz. 

J — oo 

[Use the formula for fi(u) given after ^ with 9 n in place of 9, and perform a 
simple change of variables.] By a similar argument, we also have 

r((a-2)x+n 1/2 (art n -6 n ))/(a-l) 

F 2 ,n.e n (x) = / 4>{z) 



(n 1 ' 2 {2r ln - 9 n ) < z < n 1 ' 2 ^ - n )) dz, 
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and F 3>n) e n (x) = f*^ <f>(z)l (z > n 1 / 2 (a-q n - 6 n )) dz. 

Assume first that < ( < a. In the subcase < ( < 1, Proposition [I] shows 
that the total mass of the atomic part of FscAD,n,e n converges to one, and the 
statement in part 1 then follows, since v}l 2 9 n — > v. For the remaining subcases 
to be considered observe that we have n 1 / 2 ^ — * v — oo whenever ( > 0. For the 
subcase C = 1, assume for now also that "n^l 2 {r\ n — 9 n ) -*r£ MU{— oo, oo}. Then 
the atomic part of FscAD.n,e n escapes to —v = — oo, and the total mass of the 
atomic part converges to <J>(r) by Proposition [I] In other words, Fo t n,e„(%) — ► 
$(r) for each x £ R, where Fo,n,0 n denotes the contribution from the atomic 
part of FgcAD.n,e n - Moreover, from the preceding formula for Fi n g n (x), it is 
evident that Ft in j n (x) —> f°° 4>(z)dz = 1 — $(r) holds for each i£l (because 
the upper limit in the integral diverges to oo, because the lower limit in the 
indicator is v}l 2 (j] n — 9 n ) — ► r, and because the upper limit is n 1 / 2 {2-q n — 9 n ) — ► 

oo). Hence, FscAD, n ,e n { x ) > F , n ,e n { x ) + F i,n,6„ ( x ) -> 1 for each x £ R, as 
required. Because that limit does not depend on r, and because any subsequence 
contains a further subsequence along which n 1 ^ 2 (r] n —8 n ) converges to some limit 
r £ K U {— oo, oo} (due to compactness of this space), the result follows for the 
subcase £ = 1. In the subcase 1 < C < 2, it is easy to see that Fi >ni o n (x) 
converges to one for each x £ E, whence FscAD,n,e n ( x ) ^ F ltn fi n {x) — > 1 for 
each x £ K. In the subcase C = 2, assume for now also that n 1 ^ 2 (2rj n — 6 n ) — > r £ 
IU {-oo,oo}. We then see that F ltUt g n (x) — * <&(r) and -F 1 2 ,n^„(^) — * 1 - $(r), 
whence FscAD,n,e n (x) — > 1 for each ieR. Because this limit does not depend 
on r and MU {— 00,00} is compact, a subsequence argument as above shows that 
the statement follows also in this subcase. Finally, in the subcases 2 < ( < a 
and £ = a but n 1 ^ 2 {ar] n — 9 n ) — > 00, it suffices to note that F2 t n,e n ( x ) ~ * 1 f° r 
all ieE. 

Assume next that C = a an d that n 1 ^ 2 (arj n — 9 n ) — > r € MU{— 00}. Note that 
n 1 / 2 (2?7„ — 0„) = n 1 / 2 ?7„(2 — n /r) n ) — > —00 holds because C = a > 2. Using the 
formula for /2(m) and /s(it) given after Q with u replacing x and 8 n replacing 9, 
it is then easy to see that f2{u) + f^{u) converges to the integrand in the display 
given in part 2, for almost all u. Moreover, the total mass of F 2tn .e n + F 3 n g ri 
is also easily computed and seen to converge to one. Furthermore, it is easily 
checked that the total mass of the limiting cdf displayed in part 2 is one. Schcffc's 
Lemma then shows that F2 iU ,e n + ^3,n,e„, and hence FscAD,n,e n , converge in 
total variation to the limit cdf given in part 2. 

Next, assume that £ > a. Then the integrand in the formula for F 3 n g n (x) 
converges to the density <fi(z) for each z. The Dominated Convergence Theorem 
then establishes the convergence of i*3,n,0 n) and hence of FscAD.n,9 n , to $ in 
total variation distance. 

For ( < 0, the proof is, mutatis mutandis, the same with /_ 2 , and 
/_3 now taking the roles of ft, f%, and fa, respectively, and with the case 
—a < C < —1 now being handled by showing that 1 — FscAD,n,0 n { x ) ~ * 1 
for each x £ K. Alternatively, it can be reduced to what has already been 
established by observing that FscAD,n,e n ( x ) = 1 ~ FscAD,n,-6 n {— x— ), where 
FscAD.n,-e n { — ) denotes the limit from the left of FscAD,n,-8„ at the indicated 
argument. ■ 
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The fixed-parameter asymptotic distribution of the SCAD estimator, includ- 
ing the 'oracle' property discussed at the beginning of this section, can clearly 
be recovered from Theorem [TT] by setting 6 n = 9. Like in the case of the 
hard-thresholding estimator, Theorem [TT] shows that the asymptotic behavior 
of the SCAD-estimator is much more complicated than what the 'oracle' prop- 
erty predicts. In particular, Theorem [TT] shows that the SCAD-estimator is not 
uniformly n 1 / 2 -consistent. [For a discussion of the behavior of this estimator 
under a different scaling see the next paragraph as well as the appendix.] Fur- 
thermore, since the finite-sample distribution of the SCAD-estimator is highly 
non-normal but the pointwise asymptotic distribution is normal, the latter can- 
not adequately capture many of the essential features of the former. In con- 
trast, the asymptotic distributions given in Theorem 11 are non-normal in some 
cases. All this again shows that the 'oracle' property is more of an artifact of 
the asymptotic framework than of much statistical significance. 

The observation, that the estimators 9h and 9 scad are not uniformly /in- 
consistent if tuned to perform consistent model selection, prompts the question 
of the behavior of c n (9n — 9) and c n {9scAD — 9) under a sequence of norm- 
ing constants c n that are o(n 1 / 2 ). Since both estimators are pointwise /in- 
consistent, it follows that the pointwise limiting distributions of c n (9n — 9) 
and c n (9scAD ~ 9) will then degenerate to pointmass at zero. Furthermore, it 
is not difficult to see that under general 'moving parameter' asymptotics the 
finite-sample distributions of c n {9 h — 9 n ) and c n (9scAD — n ) are then never- 
theless stochastically unbounded for certain sequences of parameters 9 n unless 
c n = O^- 1 ). If Cn = Ofc 1 ), Theorem | has shown that c n (9 h — 9 n ) and 
c n (9scAD — @n) are indeed stochastically bounded. Hence, the uniform conver- 
gence rate of Ojj and 9 scad is seen to be given precisely by r\ n . The precise 
limit distributions of these estimators under a scaling by c„ can be obtained in 
a manner similar to the above theorems and are given in Theorems 17 and 19 in 



the appendix for the (only interesting) case c n = rf^ . It turns out that the limit 
distributions under 'moving parameter' asymptotics are always given by a lin- 
ear combination of at most two pointmasses, each located in the interval [—1, 1]. 
With regard to the soft-thresholding estimator we have already observed that it 
is not even pointwise n 1 ' 2 -consistent. Even the distributions of c n {9 s — 9) with 
c n = o(n 1 / 2 ) are stochastically unbounded if 9 ^ unless c„ = 0(rQ y ). This is 
most easily seen by using the relation to the hard-thresholding estimator given 
in ([2]). If c n = 0(7]^), relation |2| also shows that c n (9s — 9) is stochastically 
bounded, but has a degenerate (pointwise) limiting distribution. This has been 
noted by Zou (2006). In view of Theorem [2] under this condition on c n the 
distributions of c n (9s — 9 n ) are in fact stochastically bounded for any sequence 
9 n . The precise forms of the possible limit distributions under such a 'moving 
parameter' asymptotic are given in Theorem 18 in the appendix. 

Theorems [9] and 11 demonstrate that the 'oracle' property of the hard- 
thresholding estimator and of the SCAD-estimator paints a misleading picture 
of the actual finite-sample behavior of these estimators due to nonuniformity 
problems. In order to rescue the 'oracle' property, sometimes the argument is 
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put forward that parameter sequences 9 n that are responsible for the nonuni- 
formity problem should be eliminated from the parameter space a priori, since 
such 9 n are supposedly close to zero and hence are difficult to distinguish sta- 
tistically from zero. While we think that such a reasoning is not sensible (be- 
cause asymptotic properties of statistical procedures that are quite unstable 
under local perturbations of the parameter are highly suspect) , we next show 
that the suggested reasoning actually is flawed: Consider first the consistently 
tuned SCAD-estimator. Suppose one considers a priori the restricted parameter 
space 6 n of the form 6„ = {9 : 9 — or \9\ > b n } for some sequence b n > 0. 
In order to achieve that for every 9 n £ O n with 9 n ^ 0, the distribution of 
n x / 2 {9scAD — n ) converges weakly to the standard normal N(0, 1) (as desired 



when attempting to rescue the 'oracle' property), it follows from Theorem 11 
that b n would have to satisfy n x l 2 r\ n (a — b n /rj n ) — > — oo (e.g., b n = brj n with 
b > a). But then it is easy to see that the 'forbidden' set K\6„ contains ele- 
ments 9 n that are large in the sense that (i) they are of order larger than n~ x / 2 
and (ii) they are classified as nonzero with probability converging to unity by 
the very same SCAD-procedure, i.e., P n ,e n (9scAD ^ 0) — > 1 holds (to see this 
use Proposition [TJ . On top of this, the parameter space 0„ is highly artifi- 
cial, depends on sample size, and also on the tuning parameter rj n and thus 
on the estimation procedure used. An analogous statement holds for the hard- 
thresholding estimator (with the exception that the 'forbidden' set in this case 
contains 9 n that are large in the sense that they satisfy (i) above and (ii) are 
classified as non-zero with probability tending to unity by any conservatively 
tuned hard-thresholding procedure). Taken together, this shows that adopting 
a parameter space like 0„ rules out values of 9 that are substantially large, 
and not only values of 9 that are statistically difficult to distinguish from zero. 
Hence, there seems to be little support for adopting such 6„ as the parameter 
space. 

Remark 12 The theorems in this subsection actually completely describe the 
limiting behavior of the finite-sample distributions of 9h, 9s, and Oscad with- 
out any condition on the sequence of parameters 9 n . To see this, just apply the 
theorems to subsequences and note that by compactness of R U {— oo, oo} we 
can select from each subsequence a further subsequence such that the relevant 
quantities like n 1 / 2 9 n , 9 n /rj n , n 1 / 2 {-q n — 9 n ), n 1 / 2 ^ + 9 n ), etc. converge in 
tU {—00,00} along this further subsequence. 

Remark 13 (i) As a point of interest we note that the full complexity of the 
possible limiting distributions in Theorems [9] [10] and [TT] already arises if we 
restrict the sequences 9 n to a bounded neighborhood of zero. Hence, the phe- 
nomena described by these theorems are of a local nature, and are not tied in 
any way to the unboundedness of the parameter space. 

(ii) It is also interesting to observe that what governs the different cases, 
in Theorems p\ and 11 is essentially the behavior of 9 n /rj n , which is of smaller 
order than n l 2 9 n because n x l 2 r\ n — > 00 in the consistent case. Hence, an 
analysis relying on the usual local asymptotics based on perturbations of 9 of 
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the order of n^ 1 / 2 does not properly reveal all possible limits of the finite- 
sample distributions in the case where the estimators perform consistent model 
selection. 



Remark 14 Similar as in Section 5.2.1[ the mathematical reason for the failure 



of the pointwise asymptotic distribution to capture the behavior of the finite- 
sample distribution well is that the convergence of the latter to the former is 
not uniform in the underlying parameter 9. See Leeb and Potscher (2003, 2005) 
for more discussion in the context of post-model-selection estimators. 



6 Impossibility results for estimating the distri- 
bution of H , s , and 9 scad 

As shown in Section |5TTj the cdfs Fu.nfi, Fs,n,9i and FsCAD,n,8 of the (centered 
and scaled) estimators 9h, 9 s, and 9 scad depend on the unknown parameter 
9 in a complicated manner. It is hence of interest to consider estimation of 
these cdfs. We show that this is an intrinsically difficult estimation problem in 
the sense that these cdfs can not be estimated in a uniformly consistent fash- 
ion. Parts of the results that follow have been presented in earlier work (in 
slightly different settings): For a general class of post- model-selection estima- 
tors including the hard-thresholding estimator, this phenomenon was discussed 
in Leeb and Potscher (2006b, 2008b) for the case where the estimator is tuned 
to be conservative, whereas Leeb and Potscher (2006a) consider the case where 
the hard-thresholding estimator is tuned to be consistent; the latter paper also 
gives similar results for a soft-thresholding estimator tuned to be conservative. 
In the following, we give a simple unified treatment of hard-thresholding, soft- 
thresholding, and also of the SCAD estimator. For the SCAD estimator and 
for the consistently tuned soft-thresholding estimator, such non- uniformity phe- 
nomena in estimating the estimator's cdf have not been established before. We 
provide large-sample results that cover both consistent and conservative choices 
of the tuning parameter, as well as finite-sample results that hold for any choice 
of tuning parameter. 

It is straight-forward to construct consistent estimators for the distributions 
of the (centered and scaled) estimators 9^, 9$ and 9 scad- O ne popular choice 
is to use subsampling or the m out of n bootstrap with m/n — > 0. Another 
possibility is to use the pointwise large-sample limit distributions derived in 
Section |5.2| together with a properly chosen pre-test of the hypothesis 9 = 
versus 9^0: Because the pointwise large-sample limit distribution takes only 
two different functional forms depending on whether 9 = or 6 ^ 0, one can 
perform a pre-test that rejects the hypothesis 9 = in case |y| > n -1 / 4 , say, 
and estimate the finite-sample distribution by that large-sample limit formula 
that corresponds to the outcome of the pre-test Q the test's critical value n~ 1//4 

9 In the consevative case, the asymptotic distribution can also depend on e which is then 
to be replaced by n L f 2 rj n . 
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ensures that the correct large-sample limit formula is selected with probability 
approaching one as sample size increases. 

When estimating the distribution of thresholding (and related) estimators, 
there is evidence in the literature that certain specific consistent estimation 
procedures, like those sketched above, may not perform well in a worst-case 
scenario. For some examples, see Kulperger and Ahmed (1992); the disclaimer 
issued after Corollary 2.1 in Beran (1997); the discussion at the end of Section 4 
in Knight and Fu (2000); or Samworth (2003). The next result shows that 
this problem is not caused by the specifics of the consistent estimators under 
consideration but is an intrinsic feature of the estimation problem itself. 

Theorem 15 Let 9 denote any one of the estimators 9h, 6s, or Qscad, and 
write F n g for the cdf of n x l 2 {0 — 0) under P n fi. Consider a sequence of tuning 
parameters such that r\ n — > and n 1 ^ 2 rj n ->eosn^oo with < e < oo. Let 
t <= K be arbitrary. Then every consistent estimator F n (t) of F n ^(t) satisfies 



lim sup P n 9 [ F n (t) - F nfi {t) > e) 

™^°° \6\<c/nU2 V ' 



= 1 



for each e < + e) — <I>(i — e))/2 and each c > \t\. Ln particular, no uniformly 
consistent estimator for F n g(t) exists. 

Proof. For two sequences 6^ and Q {2) satisfying \9$\ < c/n 1 / 2 , i = 1,2, the 
probability measures P n and P n e <2> are mutually contiguous as is elementary 
to verify (cf., e.g., Lemma A.l of Leeb and Potscher (2006a)). The corresponding 
estimands F n e (i)(i) and F n s m(i), however, do not necessarily get close to each 

other: For each 6 write 9 n (S) as shorthand for 9 n (S) — — (t + (^/n 1 / 2 . The cdfs 
Fn,9 n (8)(') an d -^n,0„(-r5)(') have a jump at t + 8 and at t — S, respectively, so 
that for S > 

F n ,e n (-S)(t)-F n ,8 n(s) (t) = ^(t-5+n x/2 Vn)-^(t-S-n^ 2 r, n ) + r(S); (11) 



cf. (|4j), (|5|), and ([7| for 9 H , 9 S , and Oscad, respectively. Moreover, r(S) 
goes to zero with 8 — > 0, because the absolutely continuous part of F n g(t) 
is a continuous function of 9 (again in view of the finite-sample formulae and 
dominated convergence). Taking the supremum of |i 7 'n,e„(-5)(^) ~ ^n,0„(5)(*) 
over all 6 with < 8 < c — |t|, we obtain that this supremum is bounded from 
below by $(t + n 1 / 2 ?^) — $(t — n 1 l 2 r\ r ^). [To see this note that this supremum 
is not less than lim^oo |i r n,e„(_i/i)(t) — Fn.8 n (\li){^)\ an d use Plj)-] Because 
that lower bound converges to $>(t + e) — $>(t — e) as n — > oo, the theorem now 
follows from Lemma 3.1 of Leeb and Potscher (2006a). [Use this result with the 
identifications (3 = 9, <p n (J3) = F n< g{t), B n = {9 : \9\ < c/n 1 / 2 }, a = 0, and 
with d(a,b) = |o — b\. Moreover, note that B n contains 9 n (8) and 9 n (—8) for 
0<8<c-\t\.} m 

We stress that the above result also applies to any kind of bootstrap- or 
subsampling-based estimator of the cdf F n _g whatsoever, since the results in 
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Leeb and Potscher (2006a) on which the proof of Theorem 15 rests apply to 
arbitrary randomized estimators (cf. Lemma 3.6 in Leeb and Potscher (2006a)); 
the same applies to Theorem [16] that follows as well as to Theorem [20] in the 
appendix. 

Loosely speaking, Theorem [15] states that any consistent estimator for the 
cdf of interest suffers from an unavoidable worst-case error of at least e with 
e < (<f>(t + e) - <f>(i - e))/2. The error range, i.e., + e) - - e))/2, is 
governed by the limit e = lim n n 1 ^ 2 rj n . In case the estimator 9 is tuned to be 
consistent, i.e., in case e = oo, the error range equals 1/2, and the phenomenon 
is most pronounced. If the estimator 9 is tuned to be conservative so that 
e < oo, the error range is less than 1/2 but can still be substantial. Only in case 
e = the error range equals zero, and the condition e < (Q(t + e) — Q(t — e))/2 
in Theorem 15 leads to a trivial conclusion. This is, however, not surprising 



as then the resulting estimator is uniformly asymptotically equivalent to the 
unrestricted maximum likelihood estimator y; cf. Remark [3] 

A similar non- uniformity phenomenon as described in Theorem [15] for con- 
sistent estimators F n (t) also occurs for not necessarily consistent estimators. 
For such arbitrary estimators, we find in the following that the phenomenon 
can be somewhat less pronounced, in the sense that the lower bound is now 
1/2 instead of 1; cf. (13 1 below. The following theorem gives a large-sample 



limit result that parallels Theorem [15] as well as a finite-sample result, both for 
arbitrary (and not necessarily consistent) estimators of the cdf. 



Theorem 16 Let 6 denote any one of the estimators 6h , 9s, or Oscad, and 
write F rh g for the cdf of n x l 2 {9 — 9) under P n- g. Let < r\ n < oo and let t £ R 
be arbitrary. Then every estimator F n (t) of F n ^{t) satisfies 



sup P nfi ( F n (t) - F nfi {t) 

\8\<c/n 1 / 2 K 



> e 



> 



(12) 



for each e < ($(t + n 1 / 2 ^) — Q(t — n 1 / 2 n n ))/2, for each c > \t\, and for each 
fixed sample size n. Lf rj n satisfies rj n — > and n 1 ^ 2 r/ n — » e as n — > oo with 
< e < oo, we thus have 



liminf inf sup 

rwoo _F„( t ) | 9 | <c/ „l/2 



Pn.ei F n {t) - F n , e (t) > e) 

1/2 V / 



1 

> 

~ 2 



(13) 



for each e < (<!>(£ + e) — <l>(i — e))/2 and for each c > \t\, where the infimum in 
(13) extends over all estimators F n (t). 



Proof. Only the finite-sample statement needs to be proven. Let 9 n (5) be as in 
the proof of Theorem 15 The total variation distance of P n ,e n (S) an d P n .e n (-8), 

| tv, goes to zero as 8 — > (which is easy to see, either 



\Pn 



M<5) 



n,9 n (-S)\ 



by direct computation or using, say, Lemma A.l of Leeb and Potscher (2006a)). 



In view of (11), however, the estimands -F 1 „,e„(<5)(^) and F rh g n ^_s){t) do not get 
close to eachother as 8 — > (8 > 0), as we have already seen in the proof of 
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Theorem 15 
left-hand 



For each e that is smaller than ji 7 ^,^ (_<$)(£) — F n 9ri ^(t) \ /2, the 



side of ( 12 1 is bounded from below by 

2 (* - ll-Pn.^W - fn,e„(-5)||Tv). 

This follows from Lemma 3.2 of Leeb and Potscher (2006a) together with Re- 
mark B.2 of that paper. [Use the result described in Remark B.2 with A = {n}, 
P = B, B n = {0 n (5),0 n (-5)}, <p n (0) = F nt8 (t), d{a,b) = \a - b\, and with 
S* equal to \F ni g n (s)(t) — Fn,d n (S)(t)\- Moreover, note that B n is contained in 
{6> : \0\ < c/n 1 / 2 } provided 0<S<c-\t\.] For 8 ->■ 0, now observe that the ex- 



pression in the preceding display converges to 1/2, i.e., the lower bound in (12), 
and that |-F n ,e n (_5)(t) - •Fn,0 n (*)(*)| converges to ^(t + n 1 / 2 ^^ - $(t - n 1 / 2 ^). 
■ 

Apart from being of interest in its own right, the asymptotic statement in 
Theorem [16] also provides additional insight into some phenomena related to 
inference based on shrinkage-type estimators that have recently attracted some 
attention: When estimating the cdf of a hard-thresholding estimator, Samworth 
(2003) noted that, while the bootstrap is not consistent, it nevertheless may per- 
form better, in a uniform sense, than the m out of n bootstrap which is consistent 



(provided m — » oo, m/n — > 0). Theorem 15 and the asymptotic statement in 
Theorem [16] together show that this phenomenon of better performance of the 
bootstrap is possible precisely because the bootstrap is not consistent. 

The finite-sample statement in Theorem [16] clearly reveals how the estima- 
bility of the cdf of the estimator depends on the tuning parameter r) n : A larger 
value of r) n , which results in a 'more sparse' estimator in view of Q, directly 
corresponds to a larger range (3>(t + n 1 / 2 ^ n ) — $(t — n 1 ^ 2 rj n ))/2 for the error s 



within which any estimator F n (t) performs poorly in the sense of (12 1. In large 
samples, the limit e = lim„ n 1 / 2 r] n takes the role of n 1 ^ 2 rj n . 

An impossibility result paralleling Theorem 16 for the cdf of f]^ 1 ^ — 8), 



where 6 = 6h , 9s, or 9 scad, is given in the appendix. 



7 Conclusion 

We have studied the distribution of the LASSO, i.e., of a soft-thresholding esti- 
mator, of the SCAD, and of a hard-thresholding estimator in finite samples and 
in the large-sample limit. The finite-sample distributions of these estimators 
were found to be highly non-normal, because they are a mixture of a singular 
normal distribution and an absolutely continuous component that can be mul- 
timodal, for example. The large-sample behavior of these distributions depends 
on the choice of the estimators' tuning parameter where, in essence, two cases 
can occur: 

In the first case, the estimator can be viewed as performing conservative 
model selection. In this case, fixed-parameter asymptotics, where the true pa- 
rameters arc held fixed while sample size increases, reflect the large-sample be- 
havior only in part. 'Moving parameter' asymptotics, where the true parameter 
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may depend on sample size, give a more complete picture. We have seen that 
the distribution of the LASSO, of the SCAD, and of the hard-thresholding esti- 
mator can be highly non-normal irrespective of sample size, in particular in the 
statistically interesting case where the true parameter is close (in an appropriate 
sense) to a lower-dimensional submodel. This also shows that the finite-sample 
phenomena that we have observed are not small-sample effects but can occur 
at any sample size. 

In the second case, the estimator can be viewed as performing consistent 
model selection, and the hard-thresholding as well as the SCAD estimator have 
the 'oracle' property in the sense of Fan and Li (2001). [This is not so for the 
LASSO.] This 'oracle' property, which is based on fixed-parameter asymptotics, 
seems to suggest that the estimator in question performs very well in large sam- 
ples. However, as before, fixed-parameter asymptotics do not capture the whole 
range of large-sample phenomena that can occur. With 'moving parameter' 
asymptotics, we have shown that the distribution of these estimators can again 
be highly non-normal, even in large samples. In addition, we have found that 
the observed finite-sample phenomena not only can persist but actually can be 
more pronounced for larger sample sizes. For example, the distribution of the 
SCAD estimator can diverge in the sense that all its mass escapes to either +00 
or —00. 

We have also demonstrated that the LASSO, the SCAD, and the hard- 
thresholding estimator are always uniformly consistent, irrespective of the choice 
of tuning parameter (except for non-sensible choices) . In case the tuning is such 
that the estimator acts as a conservative model selector, we have also seen that 
these estimators are in fact uniformly n^-consistent. However, uniform n 1 ^ 2 - 
consistency no longer holds in the case where the estimator acts like a consistent 
model selector (and where the SCAD and the hard-thresholding estimator have 
the 'oracle' property). In fact, the estimators then have a uniform convergence 
rate slower than n^ 1 / 2 in that they arc only uniformly 77" ^consistent. The 
asymptotic distributions of the estimators under an 77^ ^scaling, rather than an 
n 1 / 2 -scaling, are discussed in the appendix. 

Finally, we have studied the problem of estimating the cdf of the (centered 
and scaled) LASSO, SCAD, and hard-thresholding estimator. We have shown 
that this cdf can not be estimated in a uniformly consistent fashion, even though 
pointwise consistent estimators can be constructed with relative ease. Moreover, 
we have obtained performance bounds for estimators of the cdf that suggest that 
inconsistent estimators for this cdf may actually perform better, in a uniform 
sense, than consistent estimators. 

The phenomena observed here for distributional properties of the estimators 
under consideration not surprisingly spill over to the estimators' risk behavior. 
The finite-sample distributions derived in this paper in fact facilitate a detailed 
risk analysis, but this is not our main focus here. Therefore, we only point out 
the most important risk phenomena: We consider squared error loss scaled by 
sample size (i.e., L(9, 9) = n(9—9) 2 ), and we shall compare the estimators to the 
maximum-likelihood estimator based on the overall model, i.e., 9 V — y. In finite 
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samples, the LASSO, the SCAD, and the hard-thresholding estimator compare 
favorably with 9jj in terms of risk, if the true parameter is in a neighborhood 
of the lower dimensional model; outside of that neighborhood, the situation is 
reversed. [This is well-known for the hard- and soft-thresholding estimators and 
for more general pre-test estimators; cf. Judge and Bock (1978), Bruce and Gao 
(1996). Explicit formulae for the risk of a hard-thresholding estimator are also 
given in Leeb and Potscher (2005).] As sample size goes to infinity, again two 
cases need to be distinguished: If these estimators are tuned to perform conser- 
vative model selection, the worst-case risk of the LASSO, of the SCAD, and of 
the hard-thresholding estimator remains bounded as sample size increases. If 
the tuning is such that these estimators perform consistent model selection (the 
case when the SCAD as well as the hard-thresholding estimator have the 'oracle' 
property) , then the worst-case risk of these estimators increases indefinitely as 
sample size goes to infinity. [In fact, this is true for any estimator that has a 
'sparsity' property; see Theorem 2.1 in Leeb and Potscher (2008a) for details.] 
Thus for these estimators the asymptotic worst-case risk behavior is in marked 
contrast to their favorable pointwise asymptotic risk behavior reflected in the 
'oracle' property. For the SCAD, the LASSO, and for the hard-thresholding 
estimator, this worst-case risk behavior is also in line with the fact that these 
estimators are uniformly n^-consistent if tuned to perform conservative model 
selection, but that uniform n 1 / 2 -consistency breaks down when they are tuned 
to perform consistent model selection. 

Finally we want to stress that our results should not be read as a criticism of 
penalized maximum likelihood estimators per se, but rather as a warning that 
the distributional properties of such estimators are more intricate and complex 
than might appear at first glance. 
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A Appendix 

For the case where the estimators 9h, 6 s, and 9 scad ar e tuned to perform 
consistent model selection (i.e., r\ n — > and n 1 / 2 rj n — > oo), we now consider the 
possible limits of the distributions of c n (9u — 9 n ), c n (9s — 9 n ), and c u (9scad — 
9 n ) when c„ = O^^ 1 ). The only interesting case is where c n ~ Vn i since 
for c n = o(ry~ 1 ) these limits are always pointmass at zero in view of Theorem 
2| 10 Let GH,n,e, Gs,n,e, an d GscAD.n.e stand for the finite-sample distributions 



of rj n 1 (9 H - 9), r) n x (6 s - 9), and T]^ (6 scad - 0), respectively, under P n>e . 
Clearly, GH,n,e{x) — FH,n.e{n 1 t 2 'q n x) and similar relations hold for Gs.n,e and 



10 There is no loss in generality here in the sense that the general case where c n = 0(r] n 1 ) 
holds can - by passing to subsequences - always be reduced to the cases where c„ ~ r]^ 1 or 
c n = °(r; n " 1 ) holds. 
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GscAD,n,e- We next provide the limits of these distributions under 'moving 
parameter' asymptotics. Note that comments like in Remarks 12] 13 and 14 
also apply to the three subsequent theorems. 

Theorem 17 Consider the hard-thresholding estimator with rj n — > and 
n 1 l 2 r\ n — > oo. Assume that 9 n /rj n — > Q for some £ G R U {— oo, oo}. 

1. //|C| < 1; ifeen GH,n,e n converges weakly to pointmass S-^. 

2. If |C| = 1 and n 1 / 2 (rj n — (0 n ) — > r for some r S 1U{- oo,oo}, i/ien 
GH,n.e n converges weakly to 

$(r)*_ c + (l-$(r))* . 

5. If 1 < |C| < °°; iften GH, n ,e„ converges weakly to pointmass So. 

Proof. Consider case 1 first. On the event {Oh = 0} we have ?7~ 1 (^ — 
n ) = -r]~ l n . By Proposition [l] P n fi n {0 H = 0) — ► 1. Since i]^ 1 n -> C 
by assumption, the result follows. To prove case 2 write ^^(Oh — On) as 
— n' 1 0,^(0 h = 0) + (n 1 / 2 ?7 Jl ) _1 Z„l(^ 7^ 0) where Z„ is standard normally 
distributed under P n ,e n - Since Proposition [l] shows that Pn,e n {0n = 0) — > $(r), 
the result in case 2 now follows as is easily seen. To prove case 3, observe that 
n^^H - n ) = (n 1 / 2 r) n )- 1 n 1 / 2 C0 H - n ) and that n x l 2 (Q H - n ) converges to 
a standard normal distribution under P n ,e n in view of Theorem [9] ■ 

Theorem 18 Consider the soft-thresholding estimator with rj n — > and 
n 1 l 2 r\ n — > oo. Assume that n jr\ n — > £ /or some ( £ 1U {— oo,oo}. Then 
Gs,n,e n converges weakly to pointmass <5_ s ign(c) min(i,|C|) ■ 

Proof. From ^ we obtain that 

G S ,n,eM = *(nV 2 ri n {x+l))l(x > -On/vJ+^n^ix-l^Ux < -B n /r, n ). 

Now it is easy to see that this expression converges to if x < 
— sign(£) min(l, |£|) and to 1 if x > — sign(£) min(l, |C|). ■ 

Theorem 19 Consider the SCAD estimator with rj n — > and n 1 l 2 r\ n — > oo. 
Assume that n /i] n — * C f or some ( € KU {— oo, oo}. 



1- If ICI ^ 2, then GscAD,n.e„ converges weakly to pointmass 

^-sign(C)min(l,|CI)- 

2. If 2 < |£| < a, then GscAD.n,e n converges weakly to pointmass 

*-Bign(C)(o-|C|)/(a-2)- 

3- If a < |C| < oo, then GscAD,n,e„ converges weakly to pointmass Sq- 
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Proof. If |e| < 1 the proof is identical to the proof of case 1 in Theorem 
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Next assume £ = 1: assume also for the moment that n 1 ^ 2 (r] rl — 9 n ) — > r, 
r€lU {— oo,oo}. The atomic part Go,n,e„ of the cdf GscAD,n,e n (x) is given 
by {$(n x / 2 (-0 n + T] n )) - $(n 1 /2(_6)„ - fj„))}l(a; > -9 n /n n ) which is seen to 
converge weakly to Q(r)l(x > —1) which is the cdf of < &(r)<5_ 1 . Furthermore, 
recalling the definition of Fi <nj g given in the proof of Theorem 



11 



Gi,n,e n ( x ) = F i,nM n (n 1/2 r] n x) 

(j>(z) 1 (n 1 ' 2 ^ - 9 n ) < z < n 1 ' 2 (2i ln - 9 n )j dz 

is seen to converge to for x < — 1 and to 1 — $(r) for x > —1, since 
n 1/2 (Vn ~ n ) -> r and n 1 / 2 ^ - 0„) = rc 1 / 2 ^ - fl^J -> oo. Hence, 
G\, n ,e n converges weakly to (1 — and thus Go,n,e„ + Gi in ,# n converges 

weakly to pointmass <5_i. This implies that also GscAD,n,e n has the same 
limit. Since the limit does not depend on r, a subsequence argument as in the 



proof of Theorem 11 completes the proof of the case £ = 1. Next consider the 
case 1 < £ < 2: Here Gi iTl> $ n (x) is easily seen to converge to for x < — 1 
and to 1 for x > —1, since v}l 2 (r\ n — 9 n ) = n 1//2 ry n (l — n /Vn) ~ * ~°° an< ^ 
n 1 / 2 (2?7„ — 9 n ) — n l / 2 ri n {2 — n /Vn) ~~ * 00 ■ Hence, Gi tn ,6 n converges weakly to 
pointmass S-i, and consequently GscAD,n,e n has to have the same limit. We 
turn to the case C = 2: Assume now for the moment that n 1 ^ 2 (2rj n — 9 n ) — > r, 
r € M. U {— oo,oo}. Then Gi. nt g n (x) is seen to converge to for x < — 1 
and to <&(r) for x > —1, since n x l 2 (r\ n — 9 n ) = n 1 / 2 ^ n (l — 6 n /r) n ) — ► — oo 
and n 1 ' 2 (2r] n — 6>„) — > r. Furthermore, note that in the case considered 
((a — 2)n 1 / 2r q n x + n 1 l 2 {ar\ n — 9 n )) /{a — 1) converges to — oo for x < —I and to 
oo for x > — 1. Consequently, 

G J 2,n,fl„(») = F 2 , n ,e n (n 1/2 r) n x) 

• ((a-2)n 1 l 2 r ln x+7i 1 ' 2 (ari n -e n ))/(a-l) 

<t>{z) (14) 



(n 1 / 2 (2^„ - 0„) < z < n 1 ^^ _ e n j) 



dz 



is seen to converge to for x < — 1 and to 1 — $(r) for a; > —1, since 
n 1 / 2 (2?7 rl - 6>„) -> r and n 1 ^ 2 (af] n - 9 n ) = n 1/2 r] n (a - 6>„/r?„) -> oo. But 
this shows that Gi_ n _g n + G2, n ,e n converges weakly to pointmass 5-i, and hence 
the same must be true for GscAD,n,e n - Since the limit does not depend on 
r, a subsequence argument completes the proof for the case £ = 2. Con- 
sider next the case where 2 < £ < a: Then C?2,„,0„(#) is easily seen to con- 
verge to if a; < — (a — £)/(a — 2) and to 1 if a; > — (a — £)/(a — 2), since 
((a — 2)n 1 / 2 77„a; + n 1 l 2 (ar\ n — 6 n )) / (a— 1) converges to — oo or oo depending on 
whether x is smaller or larger than — [a — £)/(a — 2), and since n 1 ^ 2 (2r] n —6 n ) — * 
—oo and n 1 / 2 (arj n — 9 n ) — ► oo. This proves that G2,n,e n > and hence GscAD.n.Sn, 
converges weakly to pointmass <5_( Q _£)/( a _2). Assume next that £ = a and as- 
sume for the moment that n 1 l 2 (ar\ n — 9 n ) — > r, r € K U {— oo,oo}: Then the 
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upper limit in the integral defining G2, n ,e n converges to oo if x > and to — oo 
if x < 0. This is obvious if \r\ < oo, and follows from rewriting the upper limit 
as n 1 ^ 2 rj n ((a — 2)x + a — n /rj n ) /(a — 1) if |r| = oo. Furthermore, the lower 
limit in the indicator function in (14) converges to — oo, while the upper limit 
converges to r. This shows that G2, n ,e n converges weakly to <&(r)<5n • Inspec- 

1/2 

tion of G 3>nj e n (i) = F 3jn) g n (n 1/2 r] n x) = I-oo " nX ( z > nl/2 ( a Vn - On)) dz 

shows that this converges weakly to (1 — $(r))<$o- Together this gives weak con- 
vergence of G2. n ,e n + G^ t nfi n , and hence of GscAD,n,e n , to pointmass Sq. Since 
the limit does not depend on r, a subsequence argument again completes the 
proof of the case £ = a. Suppose next that a < C < oo'- Inspection of G^_ n fi n 
immediately shows that it (and hence also GscAD,n,e n ) converges weakly to 
pointmass 5q. The remaining cases for £ < — 1 are proved completely analogous 
to the corresponding cases with positive £. ■ 

Finally, we provide an impossibility result for the estimation of the finite 
sample distributions G H ,n,9, G s , n ,e, and GscAD,n,9- 

Theorem 20 Let 9 denote any one of the estimators Oh, 6s, or Oscad, and 
write G n fi for the cdf of n~ l (9 — 9) under P n< o- Let < r) n < oo and let t € R 
be arbitrary. Then every estimator G n (t) of G Ut g(t) satisfies 



sup P n , e ( 

\8\ <CT] n K 



G n (t) — G n ,e(t) 



> s 



> 



(15) 



for each e < ( < &(n 1 / 2 ?7 n (i + 1)) - $(n 1/2 ?7„ (t — l)))/2, for each c > \t\, and for 
each fixed sample size n. If r\ n satisfies r\ n — > and n 1 / 2 rj n — > oo as n — > oo, 
we thus have for each c > \t\ 



lim inf inf sup P n g ( 

,wo ° G„(t) \e\<cr) n ' V 



G n (t) — G n ^(t) 



> e 



1 

> 

~ 2 



(16) 



1, where the infimum in (16) 



for each e < 1/2 if \t\ < 1 and for e < 1/4 if \t\ 
extends over all estimators G n (t). 

This result shows, in particular, that no uniformly consistent estimator exists 
for G n .e(t) in case \t\ < 1 (not even over compact subsets of K containing the 
origin). In view of Theorems 17 18 and [l9| we see that for t > 1 we have 
supggjj \ G n ,e{t) — l|^0asn^oo, hence G n (t) — 1 is trivially a uniformly 

as 



1 



consistent estimator. 
n — > oo, hence G n (t) 



as n — > oo, 

Similarly, for t < — 1 we have sup egR \G n ^{t)\ 
- is trivially a uniformly consistent estimator. 



,1/2, 



Proof. We first prove (15 1. For fixed n and t set s 
G n (t). Also note that G n fi{t) = F nt g(s) holds. By Theorem 



t. Define F n (s) 



16 



we know that 



sup P n 



(\Fn(s) 



Fn.e{s) 



> e 



> 



for each e < (4>(s + n 1 / 2 ?7 n ) — <I>(s — n 1 / 2 n n ))/2 and for each d > |s|. Rewriting 
this in terms of t, G n (t), and G n ,e(t) and setting c = dn~ x l 2 jr\ n gives (151 
Relation ( 16 1 is a trivial consequence of ( 15 1. ■ 
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